Step 1 is always to make sure the packages needed are installed. The code below uses several packages that you will need to install in order to follow along: * igraph * network * sna
Other packages will be mentioned along the way, as necessary.
#install.packages("igraph")
#install.packages("network")
#install.packages("sna")
#install.packages("ggraph")
#install.packages("visNetwork")
#install.packages("threejs")
#install.packages("networkD3")
#install.packages("ndtv")
Colors are important for SNA and for map making. Different strategies apply to each type of visual analysis.
#Colors in R plots
#install.packages('RColorBrewer')
library('RColorBrewer')
display.brewer.all()
display.brewer.pal(8, "Set3")
display.brewer.pal(8, "Spectral")
display.brewer.pal(8, "Blues")
Network layouts are simply algorithms that return coordinates for each node in a network.Each layout returns a different algorithm.
Below you will find a list of possible layouts we can use with igraph. We are going to stick with the most common layout, called Fruchterman-Reingold, but depending on the results/research question, you might want to explore alternative layouts as they may fit the analysis better. Click here for an awesome guide to layouts using the sna package.
## [1] "layout_as_star"
## [1] "layout_components"
## [1] "layout_in_circle"
## [1] "layout_nicely"
## [1] "layout_on_grid"
## [1] "layout_on_sphere"
## [1] "layout_randomly"
## [1] "layout_with_dh"
## [1] "layout_with_drl"
## [1] "layout_with_fr"
## [1] "layout_with_gem"
## [1] "layout_with_graphopt"
## [1] "layout_with_kk"
## [1] "layout_with_lgl"
## [1] "layout_with_mds"
To begin, we will work with a data set containing information about (1) UCCS employees and (2) their connections with students. One involves a network of contacts – number and types of contacts (person and vitural) contacts – between different categories of UCCS employees. The second is a network of links between the employees and UCCS students.
The ideas behind the visualizations will apply to any dataset. However, certain visual properties such as the attributes of nodes are impossible to distinguish in larger graph maps. In fact, when drawing very big networks you may want to hide the network edges, and focus on identifying and visualizing communities of nodes, conduct multimodal analyses (like facets we saw before using ggplot) or provide charts that show key characteristics of the network graph.
Note: there is a component of SNA that uses the API of various websites/apps (e.g. Twitter).
The first data set we are going to work with consists of two files, “Dataset1-UCCS-Example-NODES.csv” and “Dataset1-UCCS-Example-EDGES.csv”
Load the data below. The data are stored in the “node” and “links” object (Note: these are used for clarity, you can name these objects anything).
nodes <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset1-UCCS-NODES.csv", header=T, as.is=T)
links <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset1-UCCS-Example-EDGES.csv", header=T, as.is=T)
Examine the data below and make sure it loaded properly into RStudio. Click on the nodes dataset and the links dataset and examine the structure.
head(nodes)
## id name level.type type.label years.worked
## 1 s01 Jon 1 Professor 10
## 2 s02 Anna 1 Professor 5
## 3 s03 Henriikka 1 Professor 7
## 4 s04 Gia 1 Professor 9
## 5 s05 Rich 1 Professor 20
## 6 s06 Katy 1 Professor 25
head(links)
## from to type weight
## 1 s01 s02 virtual 22
## 2 s01 s03 virtual 22
## 3 s01 s04 virtual 21
## 4 s01 s15 virtual 20
## 5 s02 s01 virtual 23
## 6 s02 s03 virtual 21
Next we will convert the raw data to an igraph network object. This means formatting the data that R can work with.
To do that, we will use the graph_from_data_frame() function, which takes two data frames as arguments: d and vertices.
First, make sure to install the igraph library, then invoke the library by typing:
library("igraph")
To create the graph object type:
net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
net
## IGRAPH a0c3d0e DNW- 17 49 --
## + attr: name (v/c), level.type (v/n), type.label (v/c),
## | years.worked (v/n), type (e/c), weight (e/n)
## + edges from a0c3d0e (vertex names):
## [1] Jon ->Anna Jon ->Henriikka Jon ->Gia
## [4] Jon ->Reeses Anna ->Jon Anna ->Henriikka
## [7] Anna ->Oji Anna ->Lori Henriikka->Jon
## [10] Henriikka->Gia Henriikka->Rich Henriikka->Sunni
## [13] Henriikka->Lori Henriikka->Maya Henriikka->Nonee
## [16] Gia ->Henriikka Gia ->Katy Gia ->Maya
## [19] Gia ->Nonee Gia ->Susi Rich ->Jon
## + ... omitted several edges
The description of an igraph object starts with four letters:
The two numbers that follow (17, 49) refer to the number of nodes and edges in the graph. The description also lists node & edge attributes, for example:
Let’s descibe the type of graph we have here.
We can also access nodes, edges, and their attributes by typing:
E(net) # The edges of the "net" object
## + 49/49 edges from a0c3d0e (vertex names):
## [1] Jon ->Anna Jon ->Henriikka Jon ->Gia
## [4] Jon ->Reeses Anna ->Jon Anna ->Henriikka
## [7] Anna ->Oji Anna ->Lori Henriikka->Jon
## [10] Henriikka->Gia Henriikka->Rich Henriikka->Sunni
## [13] Henriikka->Lori Henriikka->Maya Henriikka->Nonee
## [16] Gia ->Henriikka Gia ->Katy Gia ->Maya
## [19] Gia ->Nonee Gia ->Susi Rich ->Jon
## [22] Rich ->Anna Rich ->Oji Rich ->Reeses
## [25] Katy ->Katy Katy ->Oreo Katy ->Susi
## [28] Kate ->Henriikka Kate ->Sunni Kate ->Lori
## + ... omitted several edges
V(net) # The vertices of the "net" object
## + 17/17 vertices, named, from a0c3d0e:
## [1] Jon Anna Henriikka Gia Rich Katy Kate
## [8] Sunni Oji Lori Maya Nonee Gigi Rod
## [15] Reeses Oreo Susi
E(net)$type # Edge attribute "type"
## [1] "virtual" "virtual" "virtual" "virtual" "virtual" "virtual" "virtual"
## [8] "virtual" "virtual" "virtual" "virtual" "virtual" "person" "virtual"
## [15] "virtual" "virtual" "person" "person" "virtual" "person" "person"
## [22] "virtual" "virtual" "person" "virtual" "virtual" "person" "person"
## [29] "person" "virtual" "person" "virtual" "person" "person" "person"
## [36] "virtual" "person" "virtual" "person" "virtual" "person" "person"
## [43] "person" "virtual" "virtual" "virtual" "virtual" "person" "virtual"
V(net)$name # Vertex attribute "name"
## [1] "Jon" "Anna" "Henriikka" "Gia" "Rich"
## [6] "Katy" "Kate" "Sunni" "Oji" "Lori"
## [11] "Maya" "Nonee" "Gigi" "Rod" "Reeses"
## [16] "Oreo" "Susi"
# Find nodes and edges by attribute:
# (that returns objects of type vertex sequence/edge sequence)
V(net)[name=="Anna"]
## + 1/17 vertex, named, from a0c3d0e:
## [1] Anna
E(net)[type=="virtual"]
## + 30/49 edges from a0c3d0e (vertex names):
## [1] Jon ->Anna Jon ->Henriikka Jon ->Gia
## [4] Jon ->Reeses Anna ->Jon Anna ->Henriikka
## [7] Anna ->Oji Anna ->Lori Henriikka->Jon
## [10] Henriikka->Gia Henriikka->Rich Henriikka->Sunni
## [13] Henriikka->Maya Henriikka->Nonee Gia ->Henriikka
## [16] Gia ->Nonee Rich ->Anna Rich ->Oji
## [19] Katy ->Katy Katy ->Oreo Kate ->Lori
## [22] Sunni ->Henriikka Lori ->Henriikka Nonee ->Gigi
## [25] Gigi ->Nonee Reeses ->Jon Reeses ->Gia
## [28] Reeses ->Katy Oreo ->Katy Susi ->Gia
# You can also examine the network matrix directly:
net[,]
## 17 x 17 sparse Matrix of class "dgCMatrix"
## [[ suppressing 17 column names 'Jon', 'Anna', 'Henriikka' ... ]]
##
## Jon . 22 22 21 . . . . . . . . . . 20 . .
## Anna 23 . 21 . . . . . 1 5 . . . . . . .
## Henriikka 21 . . 22 1 . . 4 . 2 1 1 . . . . .
## Gia . . 23 . . 1 . . . . 22 3 . . . . 2
## Rich 1 21 . . . . . . 2 . . . . . 21 . .
## Katy . . . . . 1 . . . . . . . . . 21 21
## Kate . . 1 . . . . 22 . 21 . . . 4 . . .
## Sunni . . 2 . . . 21 . 23 . . . . . . . .
## Oji . . . . . . . . . 21 . . . . . . .
## Lori . . 2 . . . . . . . . . . . . . .
## Maya . . . . . . . . . . . . . . . . .
## Nonee . . . . . 2 . . . . . . 22 22 . . .
## Gigi . . . . . . . . . . . 21 . . . . 1
## Rod . . . . . . . . . . 1 . 21 . . . .
## Reeses 22 . . 1 . 4 . . . . . . . . . . .
## Oreo . . . . . 23 . . . . . . . . . . 21
## Susi . . . 4 . . . . . . . . . . . . .
net[1,]
## Jon Anna Henriikka Gia Rich Katy Kate
## 0 22 22 21 0 0 0
## Sunni Oji Lori Maya Nonee Gigi Rod
## 0 0 0 0 0 0 0
## Reeses Oreo Susi
## 20 0 0
net[5,7]
## [1] 0
Question: What does the code E(net)[type==“virtual”] tell us?
Question: How would we access the cell in row 6 column 9?
Question: What about row 40, column 3?
It is also easy to extract an edge list or matrix back from the igraph network:
# Get an edge list or a matrix:
as_edgelist(net, names=T)
## [,1] [,2]
## [1,] "Jon" "Anna"
## [2,] "Jon" "Henriikka"
## [3,] "Jon" "Gia"
## [4,] "Jon" "Reeses"
## [5,] "Anna" "Jon"
## [6,] "Anna" "Henriikka"
## [7,] "Anna" "Oji"
## [8,] "Anna" "Lori"
## [9,] "Henriikka" "Jon"
## [10,] "Henriikka" "Gia"
## [11,] "Henriikka" "Rich"
## [12,] "Henriikka" "Sunni"
## [13,] "Henriikka" "Lori"
## [14,] "Henriikka" "Maya"
## [15,] "Henriikka" "Nonee"
## [16,] "Gia" "Henriikka"
## [17,] "Gia" "Katy"
## [18,] "Gia" "Maya"
## [19,] "Gia" "Nonee"
## [20,] "Gia" "Susi"
## [21,] "Rich" "Jon"
## [22,] "Rich" "Anna"
## [23,] "Rich" "Oji"
## [24,] "Rich" "Reeses"
## [25,] "Katy" "Katy"
## [26,] "Katy" "Oreo"
## [27,] "Katy" "Susi"
## [28,] "Kate" "Henriikka"
## [29,] "Kate" "Sunni"
## [30,] "Kate" "Lori"
## [31,] "Kate" "Rod"
## [32,] "Sunni" "Henriikka"
## [33,] "Sunni" "Kate"
## [34,] "Sunni" "Oji"
## [35,] "Oji" "Lori"
## [36,] "Lori" "Henriikka"
## [37,] "Nonee" "Katy"
## [38,] "Nonee" "Gigi"
## [39,] "Nonee" "Rod"
## [40,] "Gigi" "Nonee"
## [41,] "Gigi" "Susi"
## [42,] "Rod" "Maya"
## [43,] "Rod" "Gigi"
## [44,] "Reeses" "Jon"
## [45,] "Reeses" "Gia"
## [46,] "Reeses" "Katy"
## [47,] "Oreo" "Katy"
## [48,] "Oreo" "Susi"
## [49,] "Susi" "Gia"
as_adjacency_matrix(net, attr="weight")
## 17 x 17 sparse Matrix of class "dgCMatrix"
## [[ suppressing 17 column names 'Jon', 'Anna', 'Henriikka' ... ]]
##
## Jon . 22 22 21 . . . . . . . . . . 20 . .
## Anna 23 . 21 . . . . . 1 5 . . . . . . .
## Henriikka 21 . . 22 1 . . 4 . 2 1 1 . . . . .
## Gia . . 23 . . 1 . . . . 22 3 . . . . 2
## Rich 1 21 . . . . . . 2 . . . . . 21 . .
## Katy . . . . . 1 . . . . . . . . . 21 21
## Kate . . 1 . . . . 22 . 21 . . . 4 . . .
## Sunni . . 2 . . . 21 . 23 . . . . . . . .
## Oji . . . . . . . . . 21 . . . . . . .
## Lori . . 2 . . . . . . . . . . . . . .
## Maya . . . . . . . . . . . . . . . . .
## Nonee . . . . . 2 . . . . . . 22 22 . . .
## Gigi . . . . . . . . . . . 21 . . . . 1
## Rod . . . . . . . . . . 1 . 21 . . . .
## Reeses 22 . . 1 . 4 . . . . . . . . . . .
## Oreo . . . . . 23 . . . . . . . . . . 21
## Susi . . . 4 . . . . . . . . . . . . .
# Or data frames describing nodes and edges:
as_data_frame(net, what="edges")
## from to type weight
## 1 Jon Anna virtual 22
## 2 Jon Henriikka virtual 22
## 3 Jon Gia virtual 21
## 4 Jon Reeses virtual 20
## 5 Anna Jon virtual 23
## 6 Anna Henriikka virtual 21
## 7 Anna Oji virtual 1
## 8 Anna Lori virtual 5
## 9 Henriikka Jon virtual 21
## 10 Henriikka Gia virtual 22
## 11 Henriikka Rich virtual 1
## 12 Henriikka Sunni virtual 4
## 13 Henriikka Lori person 2
## 14 Henriikka Maya virtual 1
## 15 Henriikka Nonee virtual 1
## 16 Gia Henriikka virtual 23
## 17 Gia Katy person 1
## 18 Gia Maya person 22
## 19 Gia Nonee virtual 3
## 20 Gia Susi person 2
## 21 Rich Jon person 1
## 22 Rich Anna virtual 21
## 23 Rich Oji virtual 2
## 24 Rich Reeses person 21
## 25 Katy Katy virtual 1
## 26 Katy Oreo virtual 21
## 27 Katy Susi person 21
## 28 Kate Henriikka person 1
## 29 Kate Sunni person 22
## 30 Kate Lori virtual 21
## 31 Kate Rod person 4
## 32 Sunni Henriikka virtual 2
## 33 Sunni Kate person 21
## 34 Sunni Oji person 23
## 35 Oji Lori person 21
## 36 Lori Henriikka virtual 2
## 37 Nonee Katy person 2
## 38 Nonee Gigi virtual 22
## 39 Nonee Rod person 22
## 40 Gigi Nonee virtual 21
## 41 Gigi Susi person 1
## 42 Rod Maya person 1
## 43 Rod Gigi person 21
## 44 Reeses Jon virtual 22
## 45 Reeses Gia virtual 1
## 46 Reeses Katy virtual 4
## 47 Oreo Katy virtual 23
## 48 Oreo Susi person 21
## 49 Susi Gia virtual 4
as_data_frame(net, what="vertices")
## name level.type type.label years.worked
## Jon Jon 1 Professor 10
## Anna Anna 1 Professor 5
## Henriikka Henriikka 1 Professor 7
## Gia Gia 1 Professor 9
## Rich Rich 1 Professor 20
## Katy Katy 1 Professor 25
## Kate Kate 2 Instructor 40
## Sunni Sunni 2 Instructor 2
## Oji Oji 2 Instructor 4
## Lori Lori 2 Instructor 5
## Maya Maya 2 Instructor 23
## Nonee Nonee 3 Staff 15
## Gigi Gigi 3 Staff 64
## Rod Rod 3 Staff 34
## Reeses Reeses 3 Staff 23
## Oreo Oreo 3 Staff 77
## Susi Susi 3 Staff 3
plot(net) # most basic plot of the network
Question: What types of structures/properties are evident from this graph?
Notice that Katy has “nominated” herself, meaning in this instance that she communicates or collaborates with herself. This information may or may not be useful given the research question. Either way, we are going to remove duplicates and self-nominations as follows:
net <- simplify(net, remove.multiple = F, remove.loops = T)
Let’s and reduce the arrow size and remove the labels (we do that by setting them to NA):
plot(net, edge.arrow.size=.4,vertex.label=NA)
You may have noticed that you can tweak the parameters for clarity, consistency, etc.
Exercise: Let’s change the edges of the above network graph by making them blue.
Below, I plot the graph using curved edges (edge.curved=.1) and reduce arrow size a bit.
plot(net, edge.arrow.size=.25, edge.curved=.1)
Note that using curved edges will allow you to see multiple links between two nodes (e.g. links going in either direction, or multiplex links)
Let’s change the edge color to red, the node & border color to orange and replace the vertex label with the node names stored in variable “name” in the nodes data set.
Notice you can use the code or the color name.
plot(net, edge.arrow.size=.2, edge.color="red",
vertex.color="orange", vertex.frame.color="#ffffff",
vertex.label=V(net)$name, vertex.label.color="navy")
The second way to set attributes is to add them to the igraph object.
Let’s color our network nodes based on type of professor, and size them based on the number of years worked (more years worked -> larger node).
We will also change the width of the edges based on the variable “weight.”
In this case, I know there are three types of people in my network, I have professors, instructors and staff.
Recap, the
Who’s Zooming Who?
# Generate colors based on professor type:
colrs <- c("green", "yellow", "red")
#This assigns green to professor, yellow to instructor and red to staff
V(net)$color <- colrs[V(net)$level.type] #Note: this variable must be numeric
#If you just wanted to visualize professors versus everyone else use:
#V(net)$color <- ifelse(nodes[V(net), 4] == "Professor", "blue", "red")
# We could also use the years worked value:
V(net)$size <- (V(net)$years.worked*10)/18
# The labels are currently node IDs.
# Setting them to NA will render no labels:
V(net)$label <- V(net)$name
# Set edge width based on weight:
E(net)$width <- E(net)$weight/7
#change arrow size and edge color:
E(net)$arrow.size <- .2
E(net)$edge.color <- "gray80"
# We can even set the network layout:
#graph_attr(net, "layout") <- layout_with_lgl
graph_attr(net, "layout") <- layout_with_fr #keeps related vertices "close"
plot(net)
plot(net)
legend(x=-1.5, y=-1.1, c("Profesor","Instructor", "Staff"), pch=21,
col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=3)
Question: What do we learn about UCCS employes from this graph?
plot(net, edge.color="orange", vertex.color="gray50")
plot(net, vertex.shape="none", vertex.label=V(net)$name,
vertex.label.font=2, vertex.label.color="gray40",
vertex.label.cex=.7, edge.color="gray85")
hist(links$weight)
mean(links$weight)
## [1] 12.40816
sd(links$weight)
## [1] 9.905635
Density is defined as the number of connected ties divided by total possible ties. It’s the number of ties that are connected out of all possible ties.
It addressed the question: How connected is this network? (Higher density where the costs of maintaining the tie are low, ~> .2)
Most social networks are not uniformly connected but rather contain pockets of high density clusters can be identified.
Here, a higher density network is a more connected network
graph.density(net,loop=FALSE)
## [1] 0.1764706
The average path length is the average number of steps in the shortest paths necessary to navigate the network.
Average path length gives us a sense of how efficiently the flow of information is through the network – has correlation with density (less dense = more hops to get to any one person (node), on average)
To calculate the average number of hops to get to any node:
mean_distance(net)
## [1] 2.742188
Intuitively, this means I can get to anyone in the network in an average of 2.74 ties.
Note: don’t look at any one measure, but rather look at the measures collectively to tell the story. For example, we may think that decreasing the mean path length may provide a more efficient information flow, but to gauge efficiency we would want to compare the average path length with the network density coefficient. This is because if the density remains constant but the average path legnth decreases, the network has a very small number of highly connected nodes. It doesn’t necessarily mean connections are happening in other parts of the network.
Assume that I am friends with Anna, and Anna has a friend
If a network is highly transitive, then since I am friends with Anna, I am also friends with that third friend
Transitivity can give an idea about potential underlying mechanisms in the network Gangs have high transitivity There may be a social mechanism enforcing the transitivity we observe – One question may be whether the clusters have a unique impact on an outcome variable
Closely related to transitivity is clustering. A cluster is defined as the proportion of closed triads over over all triads in the network (open and closed).
Clustering measures the extent to which a group of nodes are more connected to each other than to others.
A highly clustered network may be indicative of bottelnecks where “stuff” is kept within cluster members or from other clusters. Clusters can indicate “group think”.
Community detection methods can identify clusters in a network.
transitivity(net)
## [1] 0.372549
Community detection (to identify clustering) is used to mark densely connected nodes. Within the group there are dense connections and between the groups there are sparser connections.
# Community detection
cnet <- cluster_edge_betweenness(net)
## Warning in cluster_edge_betweenness(net): At community.c:460 :Membership
## vector will be selected based on the lowest modularity score.
## Warning in cluster_edge_betweenness(net): At community.c:467 :Modularity
## calculation with weighted edge betweenness community detection might not
## make sense -- modularity treats edge weights as similarities while edge
## betwenness treats them as distances
# Community detection returns an object of class "communities" which can be plotted:
class(cnet)
## [1] "communities"
plot(cnet,
net,
vertex.size = 10,
vertex.label.cex = 0.8)
You can look at the actual values being clustered by typing this:
edge_betweenness(net, directed=T, weights=NA)
## [1] 10.833333 11.333333 8.333333 9.500000 4.000000 12.500000 3.000000
## [8] 2.333333 24.000000 16.000000 31.500000 32.500000 9.500000 6.500000
## [15] 23.000000 65.333333 11.000000 6.500000 18.000000 8.666667 5.333333
## [22] 10.000000 6.000000 11.166667 15.000000 21.333333 10.000000 2.000000
## [29] 1.333333 4.500000 11.833333 16.833333 6.833333 16.833333 31.000000
## [36] 17.000000 18.000000 14.500000 7.500000 28.500000 3.000000 17.000000
## [43] 5.666667 9.666667 6.333333 1.000000 15.000000 74.500000
In any given case, another visualzation may work better. Below we create a hierarchical dendogram of scores to show relationships in a different way:
dendPlot(cnet, mode="hclust")
Each “box” represents the individuals in a densely connected group.(Sorry Anna!)
#ratio of number of edges to number of possibilities
edge_density(net, loops = F)
## [1] 0.1764706
ecount(net)/(vcount(net)*(vcount(net)-1))
## [1] 0.1764706
Positional features tell us who are the important players in the network.
A nodes’ degree centrality refers to the number of ties it has
The degree is the number of connections any node has.
The metrics associated with “degree” addresse the question of: Are high (low) connected people unique in some way or ‘at risk’ in some way.
There are two types of degree:
degree(net, mode='all') #number of connections (note: in + out = all)
## Jon Anna Henriikka Gia Rich Katy Kate
## 8 6 13 9 5 6 5
## Sunni Oji Lori Maya Nonee Gigi Rod
## 5 4 5 3 6 4 4
## Reeses Oreo Susi
## 5 3 5
degree(net, mode='in') #how many arrows are coming into a vertex (e.g. Jon)
## Jon Anna Henriikka Gia Rich Katy Kate
## 4 2 6 4 1 4 1
## Sunni Oji Lori Maya Nonee Gigi Rod
## 2 3 4 3 3 2 2
## Reeses Oreo Susi
## 2 1 4
degree(net, mode='out')
## Jon Anna Henriikka Gia Rich Katy Kate
## 4 4 7 5 4 2 4
## Sunni Oji Lori Maya Nonee Gigi Rod
## 3 1 1 0 3 2 2
## Reeses Oreo Susi
## 3 2 1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
# Plot the degree distribution for our network
deg.dist <- degree_distribution(net, cumulative=F, mode="all")
netw_degree <- as.data.frame(deg.dist)
qplot(deg.dist, data=netw_degree, geom="histogram", binwidth=.05,
xlab="Degree", ylab="Probability Density")
deg.dist <- degree_distribution(net, cumulative=T, mode="all")
plot(x=0:max(degree(net)), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency")
Question: What is the difference between these plots?
Distibution of nodes
# Histogram of node degree
V(net)$label <- V(net)$name
V(net)$degree <- degree(net)
hist(V(net)$degree,
col = 'green',
main = 'Histogram of Node Degree',
ylab = 'Frequency',
xlab = 'Degree of Vertices')
Closeness centrality measures how quickly an entity can access more entities in a network. An entity or person with a high closeness centrality generally has quick access to other entities in a network.
Betweenness centrality identifies an entity’s position within a network in terms of its ability to make connections to other pairs or groups in a network. An entity or person with high betweenness centraltiy has a greater amount of influence over what happens in a network.
reciprocity(net) # are people sending out ties also getting them?
## [1] 0.4166667
closeness(net, mode="all", weights = NA) # how close any one node is to anyone else
## Jon Anna Henriikka Gia Rich Katy
## 0.03333333 0.03030303 0.04166667 0.03846154 0.03225806 0.03125000
## Kate Sunni Oji Lori Maya Nonee
## 0.03030303 0.02857143 0.02564103 0.02941176 0.03225806 0.03571429
## Gigi Rod Reeses Oreo Susi
## 0.02702703 0.02941176 0.03030303 0.02222222 0.02857143
#can change mode to "in" (how close are you to people sending you ties) or "out"
betweenness(net, directed = T, weights = NA)
## Jon Anna Henriikka Gia Rich Katy
## 24.0000000 5.8333333 127.0000000 93.5000000 16.5000000 20.3333333
## Kate Sunni Oji Lori Maya Nonee
## 1.8333333 19.5000000 0.8333333 15.0000000 0.0000000 33.5000000
## Gigi Rod Reeses Oreo Susi
## 20.0000000 4.0000000 5.6666667 0.0000000 58.5000000
Question: Who has quicker access to other colleagues in the network?
Question: Who has relatively more influence over their colleagues?
Question: Is there a high degree of mutual relationships in this network?
Here we have a loosely connected network not a core periphery one so that’s why the closeness scores are not highly variable. This may mean nobody can uniquely diffuse information in this network, for example.
We can plot these measures onto our graph as attributes as below.
#Layout Options
set.seed(3952) # set seed to make the layout reproducible
layout1 <- layout.fruchterman.reingold(net,niter=500)
#Node or Vetex Options: Size and Color
V(net)$size=betweenness(net)/5
#because we have wide range, I am dividing by 5 to keep the high degree nodes from overshadowing everything else.
V(net)$color <- ifelse(nodes[V(net), 4] == "Professor", "green", "yellow")
#Edge Options: Color
E(net)$color <- "grey"
#Plotting, Now Specifying an arrow size and getting rid of arrow heads
#We are letting the color and the size of the node indicate the directed nature of the graph
plot(net, edge.arrow.size=.45)
Clearly Henriikka has the highest betweeness centrality here.
Here we keep links that have weight higher than the mean for the network. In igraph, we can delete edges using delete_edges(net, edges):
cut.off <- mean(links$weight)
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp, vertex.label=V(net)$name, layout=layout_with_kk)
Sometimes we want to focus the visualization on a particular node or a group of nodes. In this example, we can examine the spread of information from focal actors. For instance, let’s represent distance from Anna to Gia
The distances function returns a matrix of shortest paths from nodes listed in the v parameter to ones included in the to parameter.
prof.path <- shortest_paths(net,
from = V(net)[name=="Anna"],
to = V(net)[name=="Gia"],
output = "both") # both path nodes and edges
# Generate edge color variable to plot the path
ecol <- rep("gray80", ecount(net))
ecol[unlist(prof.path$epath)] <- "orange"
# Generate edge width variable to plot the path
ew <- rep(2, ecount(net))
ew[unlist(prof.path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- rep("gray40", vcount(net))
vcol[unlist(prof.path$vpath)] <- "gold"
plot(net, vertex.color=vcol, edge.color=ecol, vertex.label=V(net)$name,
edge.width=ew, edge.arrow.mode=0)
Make the edges a different color depending on the type of connection, virtual or in person.
E(net)$width <- 1.5
plot(net, edge.color=c("dark red", "slategrey")[(E(net)$type=="virtual")+1],
vertex.color="gray40", layout=layout.fruchterman.reingold, edge.curved=.3)
net.m <- net - E(net)[E(net)$type=="virtual"] # another way to delete edges
net.h <- net - E(net)[E(net)$type=="person"] # using the minus operator
# Plot the two links separately:
par(mfrow=c(1,2))
plot(net.h, vertex.color="orange", layout=layout_with_fr, main="Virtual Connections")
plot(net.m, vertex.color="lightsteelblue2", layout=layout_with_fr, main="Person Connections")
library(ggraph)
## Warning: package 'ggraph' was built under R version 3.6.1
library(igraph)
ggraph(net) +
geom_edge_link() + # add edges to the plot
geom_node_point() # add nodes to the plot
## Using `stress` as default layout
ggraph(net, layout="fr") +
geom_edge_link() +
ggtitle("Look ma, no nodes!") # add title to the plot
ggraph(net, layout="fr") +
geom_edge_fan(color="gray50", width=0.8) +
geom_node_point(color=V(net)$color, size=8) +
theme_void()
ggraph(net, layout = 'linear') +
geom_edge_arc(color = "orange", width=0.8) +
geom_node_point(size=5, color="gray50") +
theme_void()
ggraph(net, layout="lgl") +
geom_edge_link(aes(color = type)) + # colors by edge type
geom_node_point(aes(size = years.worked)) + # size by audience size
theme_void()
ggraph(net, layout = 'lgl') +
geom_edge_arc(color="gray", curvature=0.3) +
geom_node_point(color="orange", aes(size = years.worked)) +
geom_node_text(aes(label = name), size=3, color="gray50", repel=T) +
theme_void()
## Warning: The curvature argument has been deprecated in favour of strength
As you might remember, our second media example is a two-mode network examining links between news sources and their consumers.
nodes2 <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset2-UCCS-User-Example-NODES.csv", header=T, as.is=T)
links2 <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset2-UCCS-User-Example-EDGES.csv", header=T, row.names=1)
head(nodes2)
## id name level.type type.label years.worked
## 1 s01 Jon 1 Professor 10
## 2 s02 Anna 1 Professor 5
## 3 s03 Henriikka 1 Professor 7
## 4 s04 Gia 1 Professor 9
## 5 s05 Rich 1 Professor 20
## 6 s06 Katy 1 Professor 25
head(links2)
## U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s02 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## s03 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## s04 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
## s05 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## s06 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## U18 U19 U20
## s01 0 0 0
## s02 0 0 1
## s03 0 0 0
## s04 0 0 0
## s05 0 0 0
## s06 0 0 0
links2 <- as.matrix(links2)
dim(links2)
## [1] 10 20
dim(nodes2)
## [1] 30 5
head(nodes2)
## id name level.type type.label years.worked
## 1 s01 Jon 1 Professor 10
## 2 s02 Anna 1 Professor 5
## 3 s03 Henriikka 1 Professor 7
## 4 s04 Gia 1 Professor 9
## 5 s05 Rich 1 Professor 20
## 6 s06 Katy 1 Professor 25
head(links2)
## U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s02 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## s03 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## s04 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
## s05 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## s06 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## U18 U19 U20
## s01 0 0 0
## s02 0 0 1
## s03 0 0 0
## s04 0 0 0
## s05 0 0 0
## s06 0 0 0
net2 <- graph_from_incidence_matrix(links2)
table(V(net2)$type)
##
## FALSE TRUE
## 10 20
head(nodes2)
## id name level.type type.label years.worked
## 1 s01 Jon 1 Professor 10
## 2 s02 Anna 1 Professor 5
## 3 s03 Henriikka 1 Professor 7
## 4 s04 Gia 1 Professor 9
## 5 s05 Rich 1 Professor 20
## 6 s06 Katy 1 Professor 25
head(links2)
## U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s02 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## s03 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## s04 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
## s05 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## s06 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## U18 U19 U20
## s01 0 0 0
## s02 0 0 1
## s03 0 0 0
## s04 0 0 0
## s05 0 0 0
## s06 0 0 0
net2
## IGRAPH aa576ef UN-B 30 31 --
## + attr: type (v/l), name (v/c)
## + edges from aa576ef (vertex names):
## [1] s01--U01 s01--U02 s01--U03 s02--U04 s02--U05 s02--U20 s03--U06
## [8] s03--U07 s03--U08 s03--U09 s04--U09 s04--U10 s04--U11 s05--U11
## [15] s05--U12 s05--U13 s06--U13 s06--U14 s06--U17 s07--U14 s07--U15
## [22] s07--U16 s08--U16 s08--U17 s08--U18 s08--U19 s09--U06 s09--U19
## [29] s09--U20 s10--U01 s10--U11
plot(net2, vertex.label=V(net2)$name)
As with one-mode networks, we can modify the network object to include the visual properties that will be used by default when plotting the network. Notice that this time we will also change the shape of the nodes - UCCS employees will be squares, and students will be circles.
# Professors are blue squares, student nodes are orange circles:
V(net2)$color <- c("steel blue", "orange")[V(net2)$type+1]
V(net2)$shape <- c("square", "circle")[V(net2)$type+1]
#Professors will have name labels, students will not:
V(net2)$label <- ""
V(net2)$label[V(net2)$type==F] <- nodes2$name[V(net2)$type==F]
V(net2)$label.cex=.6
V(net2)$label.font=2
plot(net2, vertex.label.color="white", vertex.size=(2-V(net2)$type)*8)
plot(net2, vertex.shape="none", vertex.label=nodes2$name,
vertex.label.color=V(net2)$color, vertex.label.font=2,
vertex.label.cex=.6, edge.color="gray70", edge.width=2)
Below I include the use of images as nodes. In order to do this, you will need the png package (if missing, install with install.packages(‘png’)
# install.packages('png')
library('png')
img.1 <- readPNG("D:/Barboza Law/COC presentation/Day 3/computer.png")
img.2 <- readPNG("D:/Barboza Law/COC presentation/Day 3/User-Administrator-Green-icon.png")
V(net2)$raster <- list(img.1, img.2)[V(net2)$type+1]
plot(net2, vertex.shape="raster", vertex.label=NA,
vertex.size=16, vertex.size2=16, edge.width=2)
netm <- get.adjacency(net, attr="weight", sparse=F)
colnames(netm) <- V(net)$name
rownames(netm) <- V(net)$name
palf <- colorRampPalette(c("gold", "dark orange"))
heatmap(netm[,17:1], Rowv = NA, Colv = NA, col = palf(100),
scale="none", margins=c(10,10) )
library('maps')
library('geosphere')
## Warning: package 'geosphere' was built under R version 3.6.1
par(mfrow = c(2,2), mar=c(0,0,0,0))
map("usa", col="tomato", border="gray10", fill=TRUE, bg="gray30")
map("state", col="orange", border="gray10", fill=TRUE, bg="gray30")
map("county", col="palegreen", border="gray10", fill=TRUE, bg="gray30")
map("world", col="skyblue", border="gray10", fill=TRUE, bg="gray30")
airports <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset3-Airlines-NODES.csv", header=TRUE)
flights <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Dataset3-Airlines-EDGES.csv", header=TRUE, as.is=TRUE)
head(flights)
## Source Target Freq
## 1 0 109 10
## 2 1 36 10
## 3 1 61 10
## 4 2 152 10
## 5 3 104 10
## 6 4 132 10
head(airports)
## ID Label Code City latitude longitude
## 1 0 Adams Field Airport LIT Little Rock, AR 34.72944 -92.22444
## 2 1 Akron/canton Regional CAK Akron/Canton, OH 40.91611 -81.44222
## 3 2 Albany International ALB Albany 42.73333 -73.80000
## 4 3 Albemarle CHO Charlottesville 38.13333 -78.45000
## 5 4 Albuquerque International ABQ Albuquerque 35.04028 -106.60917
## 6 5 Alexandria International AEX Alexandria, LA 31.32750 -92.54861
## ToFly Visits
## 1 0 105
## 2 0 123
## 3 0 129
## 4 1 114
## 5 0 105
## 6 0 93
# Select only large airports: ones with more than 10 connections in the data.
tab <- table(flights$Source)
big.id <- names(tab)[tab>10]
airports <- airports[airports$ID %in% big.id,]
flights <- flights[flights$Source %in% big.id &
flights$Target %in% big.id, ]
# Plot a map of the united states:
map("state", col="grey20", fill=TRUE, bg="black", lwd=0.1)
# Add a point on the map for each airport:
points(x=airports$longitude, y=airports$latitude, pch=19,
cex=airports$Visits/80, col="orange")
col.1 <- adjustcolor("orange red", alpha=0.4)
col.2 <- adjustcolor("orange", alpha=0.4)
edge.pal <- colorRampPalette(c(col.1, col.2), alpha = TRUE)
edge.col <- edge.pal(100)
for(i in 1:nrow(flights)) {
node1 <- airports[airports$ID == flights[i,]$Source,]
node2 <- airports[airports$ID == flights[i,]$Target,]
arc <- gcIntermediate( c(node1[1,]$longitude, node1[1,]$latitude),
c(node2[1,]$longitude, node2[1,]$latitude),
n=1000, addStartEnd=TRUE )
edge.ind <- round(100*flights[i,]$Freq / max(flights$Freq))
lines(arc, col=edge.col[edge.ind], lwd=edge.ind/30)
}
To understand how the network helps inform the research question, there are two broad approaches:
Both leverage connection and position.
library(sna)
## Warning: package 'sna' was built under R version 3.6.3
## Loading required package: statnet.common
## Warning: package 'statnet.common' was built under R version 3.6.3
##
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
##
## order
## Loading required package: network
## Warning: package 'network' was built under R version 3.6.3
## network: Classes for Relational Data
## Version 1.16.0 created on 2019-11-30.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
##
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
##
## %c%, %s%, add.edges, add.vertices, delete.edges,
## delete.vertices, get.edge.attribute, get.edges,
## get.vertex.attribute, is.bipartite, is.directed,
## list.edge.attributes, list.vertex.attributes,
## set.edge.attribute, set.vertex.attribute
## sna: Tools for Social Network Analysis
## Version 2.5 created on 2019-12-09.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## For citation information, type citation("sna").
## Type help(package="sna") to get started.
##
## Attaching package: 'sna'
## The following objects are masked from 'package:igraph':
##
## betweenness, bonpow, closeness, components, degree,
## dyad.census, evcent, hierarchy, is.connected, neighborhood,
## triad.census
library(network)
# The emon dataset - interorganizational networks
#=============================================================#
# We'll use the emon dataset: interorganizational Search and Rescue
# Networks (Drabek et al.), included in the "network" package.
# The dataset contains 7 networks, each node has 8 attributes
#?emon
data(emon)
emon # a list of 7 networks
## $Cheyenne
## Network attributes:
## vertices = 14
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 83
## missing edges= 0
## non-missing edges= 83
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $HurrFrederic
## Network attributes:
## vertices = 21
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 118
## missing edges= 0
## non-missing edges= 118
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $LakePomona
## Network attributes:
## vertices = 20
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 148
## missing edges= 0
## non-missing edges= 148
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $MtSi
## Network attributes:
## vertices = 13
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 33
## missing edges= 0
## non-missing edges= 33
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $MtStHelens
## Network attributes:
## vertices = 27
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 123
## missing edges= 0
## non-missing edges= 123
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $Texas
## Network attributes:
## vertices = 25
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 186
## missing edges= 0
## non-missing edges= 186
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
##
## $Wichita
## Network attributes:
## vertices = 20
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## total edges= 149
## missing edges= 0
## non-missing edges= 149
##
## Vertex attribute names:
## Command.Rank.Score Decision.Rank.Score Formalization Location Paid.Staff Sponsorship vertex.names Volunteer.Staff
##
## Edge attribute names:
## Frequency
# Cheyenne network (the first one in the list)
ch.net <- emon$Cheyenne
plot(ch.net)
# Extract the node attributes of the Cheyenne network into a data frame
ch.attr <- data.frame(matrix (0,14,8))
colnames(ch.attr) <- c("vertex.names", "Command.Rank.Score", "Decision.Rank.Score", "Formalization", "Location", "Paid.Staff", "Sponsorship", "Volunteer.Staff")
# Copy each of the 8 vertex attributes to a variable in the ch.attr data frame
for (i in 1:8) { ch.attr[,i] <- (ch.net %v% colnames(ch.attr)[i]) }
ch.attr
## vertex.names Command.Rank.Score
## 1 Wyoming.Disaster.and.Civil.Defense.Agnecy 0
## 2 Wyoming.State.National.Guard..Army.and.Air. 10
## 3 Wyoming.State.Highway.Patrol 3
## 4 Francis.E..Warren.Air.Force.Base..fire.unit. 5
## 5 Red.Cross 0
## 6 Salvation.Army 0
## 7 Laramie.County.Sheriff.s.Office 20
## 8 Laramie.County...Cheyenne.Civil.Defense.Agency 40
## 9 Cheyenne.Fire.Department 10
## 10 Cheyenne.Police.Department 30
## 11 Laramie.County.Fire.District..1..volunteer. 2
## 12 Laramie.County.Fire.District..2..volunteer. 2
## 13 A.1.Ambulance.Service 2
## 14 Shy.Wy.HAM.Radio.Club 0
## Decision.Rank.Score Formalization Location Paid.Staff Sponsorship
## 1 20 2 L 10 State
## 2 7 1 L 400 State
## 3 0 1 L 200 State
## 4 5 1 L 60 Federal
## 5 0 1 L 1 Private
## 6 0 1 L 7 Private
## 7 20 1 L 60 County
## 8 50 2 L 7 County/City
## 9 10 1 L 70 City
## 10 20 3 L 100 City
## 11 3 NA L NA County
## 12 3 1 L 1 County
## 13 2 1 L 10 Private
## 14 0 1 L 0 Private
## Volunteer.Staff
## 1 50
## 2 2000
## 3 0
## 4 0
## 5 20
## 6 80
## 7 20
## 8 100
## 9 0
## 10 0
## 11 NA
## 12 20
## 13 6
## 14 NA
# Correlation & Linear Regression in R
#=============================================================#
# Check the correlation between command rank and decision rank scores:
# (remember that ch.attr[[2]] is the same as ch.attr$Command.Rank.Score, etc.)
# Calculate the correlation btw command and decision rank
cor(ch.attr$Command.Rank.Score, ch.attr$Decision.Rank.Score)
## [1] 0.8718257
cor.test(ch.attr$Command.Rank.Score, ch.attr$Decision.Rank.Score) # Examine the significance
##
## Pearson's product-moment correlation
##
## data: ch.attr$Command.Rank.Score and ch.attr$Decision.Rank.Score
## t = 6.1658, df = 12, p-value = 4.83e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6349626 0.9588618
## sample estimates:
## cor
## 0.8718257
# Linear regression: lm(DV ~ IV1 + IV2 + IV3 + ...)
# DV = dependent variable, IV = independent variables
# Calculate centrality measures and store them in the ch.attr data frame:
ch.attr$IndegCent <- degree(ch.net, gmode="digraph", cmode="indegree") # indegree centralities
ch.attr$OutdegCent <- degree(ch.net, gmode="digraph", cmode="outdegree") # outdegree centralities
ch.attr$BetweenCent <- betweenness(ch.net, gmode="digraph") # betweenness centralities
# We can use the centralities in a linear regression model:
ch.lm.1 <- lm(ch.attr$Command.Rank.Score ~ ch.attr$Paid.Staff + ch.attr$BetweenCent) # Model with betweenness centrality
summary(ch.lm.1)
##
## Call:
## lm(formula = ch.attr$Command.Rank.Score ~ ch.attr$Paid.Staff +
## ch.attr$BetweenCent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.271 -5.038 -2.433 1.739 26.225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.34254 4.68527 0.500 0.6279
## ch.attr$Paid.Staff 0.01285 0.02886 0.445 0.6657
## ch.attr$BetweenCent 0.88864 0.38665 2.298 0.0444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.39 on 10 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3498, Adjusted R-squared: 0.2198
## F-statistic: 2.69 on 2 and 10 DF, p-value: 0.1162
Below I used the same procedure to cluster the fake corrections data. There were only 2 numeric variables in the data set so I could only use those. This illustrates that if something doesn’t visualize well, there will be an alternative way that may be relatively “better.”
We are clustering mhneeds_lvl and subs_lvl_values across incident codes (ldesc). Instead of visualizng the clusters I created a dendogram using fviz_dend.
library(tidyverse) # data manipulation
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 1.0.0 v dplyr 0.8.3
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'dplyr' was built under R version 3.6.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose() masks igraph::compose()
## x tidyr::crossing() masks igraph::crossing()
## x dplyr::filter() masks stats::filter()
## x dplyr::groups() masks igraph::groups()
## x dplyr::lag() masks stats::lag()
## x purrr::map() masks maps::map()
## x purrr::simplify() masks igraph::simplify()
library(cluster) # clustering algorithms
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
library(factoextra) # clustering algorithms & visualization (fvis cluster and fvis distance)
## Warning: package 'factoextra' was built under R version 3.6.1
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fake_data <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/fake_data.csv")
#Select only data on the columns of interest. Here I selected the variable ldesc and the two numeric variables: #mhneeds_lvl and subs_lvl_values.
df <- fake_data[, c(9, 4:5)]
#This command aggregates the data by ldesc and calculates the overall means for the mhneeds and substance use needs for each ldesc
df <- aggregate(fake_data[, 4:5], list(ldesc=fake_data$ldesc), mean)
#We cannot cluster on the variable ldesc so below we set that column to be labels
df2 <- df[,-1]
rownames(df2) <- df[,1]
#Here we scale the data to normalize it, by subtracting each variable from the mean and then dividing by std dev
df2 <- scale(df2)
#The command head is like glimpse and allows us to take a look at a few rows of the data, the default is 5 rows
head(df2)
## mhneeds_lvl subs_lvl_values
## ACT OF SABOTAGE -1.3839785 1.0960671
## ESCAPE W/ FORCE 0.9999968 1.0379304
## PROPERTY DAMAGE 0.7597913 -0.8025534
## SEXUAL ACT -0.6551899 -0.3567935
## USE OF FORCE 0.2793802 -0.9746506
#This calculates the distance between observations
distance <- get_dist(df2)
#This command visualizes the distance. Note: you can change these colors to anything you want.
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
final <- kmeans(df2, 3, nstart = 25)
fviz_cluster(final, data = df2)
# Visualize dendrogram using the cluster agglomeration method "complete" which computes all pairwise dissimiarities
hc.cut <- hcut(df2, k = 3, hc_method = "complete")
fviz_dend(hc.cut, show_labels = TRUE, rect = TRUE)
Property damage and use of force are most similar, then sexual act, etc.
As we discussed last week, the data are rarely, if ever, in the format you need. At a minimum, new variables must be created, the variables may need recoding, variable summaries will need to be created, and often its important to rename variables. One of the most powerful packages to reformat data in R is called dplyr. We will cover dplyr basics below. Let’s use a different data set on crimes from the city of Los Angeles.
Let’s take a look at the data we will be using. The way these data are made available is very common, as all big, open data websites allow multiple methods of downloading the data. Additionally, if the data are in spatial format, there are several options for downloading the spatial data as well. Note, sometimes the data are spatial but are made available in a non-spatial way, such as a csv file. No need to worry because those data can be transformed and then mapped accordingly.
From the city of LA open website, click data catalog. This is data we saw last week. Notice the column called “Crm Cd Desc” of the types of crime. Let’s look at data on domestic violence. There are two ways to filter these data.
Download the data to your computer and open the file. First, open it in Excel.
I already uploaded this data to my githum, let’s load it into RStudio.
Let’s call this data set IPV_data.
IPV_data <- read.csv("https://raw.githubusercontent.com/elisegia/COC/master/Crime_Data_from_2020_to_Present_import.csv")
#glimpse allows us to view the first few rows of the data
glimpse(IPV_data)
## Observations: 1,363
## Variables: 40
## $ ï..DR_NO <int> 200206684, 200104417, 200104511, 200104664, ...
## $ Date.Rptd <fct> 2/23/20 0:00, 1/6/20 0:00, 1/8/20 0:00, 1/9/...
## $ DATE.OCC <fct> 2/23/20 0:00, 1/6/20 0:00, 1/8/20 0:00, 1/9/...
## $ TIME.OCC <int> 315, 2100, 400, 1900, 2130, 2300, 800, 725, ...
## $ newdate <fct> 02/23/2020, 01/06/2020, 01/08/2020, 01/09/20...
## $ formatted_newdate <fct> 02/23/2020, 01/06/2020, 01/08/2020, 01/09/20...
## $ serialdate <int> 43884, 43836, 43838, 43839, 43839, 43843, 43...
## $ day <int> 23, 6, 8, 9, 9, 13, 1, 18, 18, 29, 27, 28, 2...
## $ month <int> 2, 1, 1, 1, 1, 1, 3, 1, 1, 6, 1, 1, 1, 2, 2,...
## $ year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20...
## $ weekday <fct> Wed, Thu, Thu, Mon, Sun, Sat, Sat, Mon, Mon,...
## $ time_text <dbl> 0.13541667, 0.87500000, 0.16666667, 0.791666...
## $ time_serial <fct> 03:15, 21:00, 04:00, 19:00, 21:30, 23:00, 08...
## $ time <fct> 3:15 AM, 9:00 PM, 4:00 AM, 7:00 PM, 9:30 PM,...
## $ time.1 <fct> 3:15 AM, 9:00 PM, 4:00 AM, 7:00 PM, 9:30 PM,...
## $ date_time <fct> 02/23/2020 0.135416666666667, 01/06/2020 0.8...
## $ AREA <int> 2, 1, 1, 1, 1, 1, 21, 1, 1, 5, 1, 1, 1, 1, 1...
## $ AREA.NAME <fct> Rampart, Central, Central, Central, Central,...
## $ Rpt.Dist.No <int> 256, 182, 154, 156, 138, 156, 2126, 111, 118...
## $ Part.1.2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Crm.Cd <int> 236, 236, 236, 236, 236, 236, 236, 236, 236,...
## $ Crm.Cd.Desc <fct> INTIMATE PARTNER - AGGRAVATED ASSAULT, INTIM...
## $ Mocodes <fct> 2000 1813 2002 0334 0445 1601 0913, 0913 031...
## $ Vict.Age <int> 37, 26, 38, 45, 38, 46, 41, 22, 21, 42, 29, ...
## $ Vict.Sex <fct> M, F, M, F, M, F, F, F, F, F, F, M, F, M, M,...
## $ Vict.Descent <fct> H, H, B, B, B, W, O, W, H, W, B, H, H, O, B,...
## $ Premis.Cd <int> 502, 502, 502, 502, 502, 502, 501, 502, 502,...
## $ Premis.Desc <fct> "MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC...
## $ Weapon.Used.Cd <int> 200, 400, 302, 400, 216, 400, 400, 400, 400,...
## $ Weapon.Desc <fct> "KNIFE WITH BLADE 6INCHES OR LESS", "STRONG-...
## $ Status <fct> AO, IC, AA, AA, AA, IC, AO, AA, AA, IC, AO, ...
## $ Status.Desc <fct> Adult Other, Invest Cont, Adult Arrest, Adul...
## $ Crm.Cd.1 <int> 236, 236, 236, 236, 236, 236, 236, 236, 236,...
## $ Crm.Cd.2 <int> NA, NA, 998, NA, 998, NA, NA, NA, 998, NA, 9...
## $ Crm.Cd.3 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Crm.Cd.4 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ LOCATION <fct> 600 S UNION AV, 1200...
## $ Cross.Street <fct> , , , , , WALL ST, ,...
## $ LAT <dbl> 34.0565, 34.0423, 34.0474, 34.0449, 34.0503,...
## $ LON <dbl> -118.2724, -118.2666, -118.2496, -118.2458, ...
There are several data types that we want to note. Also, there is a variable called “Premise Desc”. Let’s make a table of values so we can see the types of premises that are associated with IPV. The default table in R is ugly so I am going to use a library called kable (along with the commands of dplyr) which formats the table nicely for me. To use kable we need to load the knitr library.
library(knitr)
IPV_data %>%
group_by(Premis.Desc)%>%
summarize(n=n())%>%
arrange(-n) %>%
kable()
| Premis.Desc | n |
|---|---|
| SINGLE FAMILY DWELLING | 457 |
| MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC) | 429 |
| STREET | 199 |
| SIDEWALK | 61 |
| PARKING LOT | 42 |
| HOTEL | 23 |
| VEHICLE, PASSENGER/TRUCK | 22 |
| MOTEL | 17 |
| PARK/PLAYGROUND | 12 |
| ALLEY | 11 |
| DRIVEWAY | 10 |
| OTHER RESIDENCE | 10 |
| OTHER BUSINESS | 7 |
| MOBILE HOME/TRAILERS/CONSTRUCTION TRAILERS/RV’S/MOTORHOME | 6 |
| GAS STATION | 5 |
| SINGLE RESIDENCE OCCUPANCY (SRO’S) LOCATIONS | 5 |
| TRANSIENT ENCAMPMENT | 5 |
| YARD (RESIDENTIAL/BUSINESS) | 5 |
| ABANDONED BUILDING ABANDONED HOUSE | 3 |
| GARAGE/CARPORT | 3 |
| PARKING UNDERGROUND/BUILDING | 3 |
| FREEWAY | 2 |
| NURSING/CONVALESCENT/RETIREMENT HOME | 2 |
| OTHER PREMISE | 2 |
| RESTAURANT/FAST FOOD | 2 |
| TRAIN TRACKS | 2 |
| TRANSITIONAL HOUSING/HALFWAY HOUSE | 2 |
| BEACH | 1 |
| BEAUTY/BARBER SHOP | 1 |
| CAR WASH | 1 |
| ELEMENTARY SCHOOL | 1 |
| FIRE STATION | 1 |
| GROUP HOME | 1 |
| HOSPITAL | 1 |
| LAUNDROMAT | 1 |
| MINI-MART | 1 |
| MISSIONS/SHELTERS | 1 |
| PROJECT/TENEMENT/PUBLIC HOUSING | 1 |
| SHORT-TERM VACATION RENTAL | 1 |
| STUDIO (FILM/PHOTOGRAPHIC/MUSIC) | 1 |
| TATTOO PARLOR* | 1 |
| TRANSPORTATION FACILITY (AIRPORT) | 1 |
| UNDERPASS/BRIDGE* | 1 |
Let’s select “SINGLE FAMILY DWELLING” and “MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)” so we are sure this is domestic. We use “filter” to filter the data by rows in this way.
IPV_subset <- filter(IPV_data, Premis.Desc =="SINGLE FAMILY DWELLING" | Premis.Desc == "MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)")
Looks Good!
Let’s create a heat map by day and month. We start by making a table of weekday and month. The we store the table results into a data frame. We do this so we can manipulate the variables a bit. For example, I want the days to be in a given order Mon - Sun. When we store data in this way, the variables names are by default. I changed them back to “Weekday” and “Month” and used ggplot with the geom geom_tile to visualize the results.
table(IPV_subset$weekday, IPV_subset$month)
##
## 1 2 3 4 5 6
## Fri 18 19 19 20 19 21
## Mon 28 15 28 28 23 17
## Sat 23 14 21 23 30 17
## Sun 25 28 24 25 23 25
## Thu 23 14 17 30 18 16
## Tue 19 15 31 15 17 19
## Wed 22 21 26 15 19 16
DayHourCount <- as.data.frame(table(IPV_subset$weekday, IPV_subset$month))
DayHourCount$Var1 <- factor(DayHourCount$Var1, ordered = T, levels = c("Mon", "Tue", "Wed", "Thu","Fri", "Sat","Sun"))
colnames(DayHourCount)[1] <- "Weekday"
colnames(DayHourCount)[2] <- "Month"
ggplot(DayHourCount, aes(x = Weekday, y = Month)) +
geom_tile(aes(fill = Freq))
Now, let’s select a few columns to make the data more manageable, and then look at some summary statistics.
IPV_subset_columns <- select(IPV_subset, formatted_newdate, TIME.OCC, day, month, time, Vict.Age, Vict.Sex, Vict.Descent, Mocodes, LAT, LON, Status.Desc)
Nice. Let’s create grouped summaries now. We do that with the group_by function in dplyr. When you use the dplyr commands on a grouped dataset they will automatically be applied to the group. This example also introduces the “pipe” operator which allows you to perform multiple operations at once.
library(dplyr)
IPV_subset_columns %>%
group_by(Vict.Sex) %>%
summarize(mean_age = mean(Vict.Age, na.rm = TRUE)) %>%
kable()
| Vict.Sex | mean_age |
|---|---|
| F | 32.63636 |
| M | 39.91593 |
Here is another example
IPV_subset_columns %>%
group_by(Vict.Sex) %>%
filter(Vict.Descent =="W") %>%
summarize(
count = n(), mean_age = mean(Vict.Age, na.rm=TRUE)
)
## # A tibble: 2 x 3
## Vict.Sex count mean_age
## <fct> <int> <dbl>
## 1 F 113 34.9
## 2 M 37 40.8
And another… what does this code do?
IPV_subset_columns %>%
group_by(Vict.Sex, Vict.Descent) %>%
summarize(
count = n(), mean_age = mean(Vict.Age, na.rm=TRUE)
)
## # A tibble: 13 x 4
## # Groups: Vict.Sex [2]
## Vict.Sex Vict.Descent count mean_age
## <fct> <fct> <int> <dbl>
## 1 F A 15 28.7
## 2 F B 211 32.8
## 3 F F 1 48
## 4 F H 289 31.5
## 5 F K 2 35
## 6 F O 29 35.3
## 7 F W 113 34.9
## 8 M A 4 40
## 9 M B 86 42.8
## 10 M H 91 37.4
## 11 M K 1 30
## 12 M O 7 34.3
## 13 M W 37 40.8
Manipulating data in this way is useful for the visualization techniques below. For example, the Sankey diagram. First let’s do another exercise.
Calculate the mean levels of mental health needs for black and whites by incident description (ldesc). Here is the data:
str(fake_data)
## 'data.frame': 299 obs. of 11 variables:
## $ ï..docno : int 391501 803678 606522 482018 966031 841411 568293 146773 627755 939000 ...
## $ ethnic_cd : Factor w/ 8 levels "A","B","H","I",..: 1 6 4 6 1 2 8 1 2 2 ...
## $ highschool_ged : Factor w/ 4 levels "G","H","M","N": 2 4 1 2 4 4 3 3 1 4 ...
## $ mhneeds_lvl : int 3 4 3 3 1 1 4 5 4 4 ...
## $ subs_lvl_values: int 1 5 3 4 4 2 3 2 3 2 ...
## $ off_cat : int 31 31 31 31 31 31 31 31 31 31 ...
## $ off_mdesc : Factor w/ 5 levels "ABUSE OF PUBLIC OFFICE-MISDEMEANOR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ incident_cod : Factor w/ 5 levels "AC","ES","PD",..: 1 1 4 3 3 1 1 4 3 1 ...
## $ ldesc : Factor w/ 5 levels "ACT OF SABOTAGE",..: 1 1 4 3 3 1 1 4 3 1 ...
## $ fac_cd : Factor w/ 3 levels "DC","SF","YP": 1 3 1 3 3 3 2 3 3 2 ...
## $ lu_cd : Factor w/ 22 levels "1","1A","2","21",..: 13 18 13 22 20 21 5 22 19 1 ...
fake_data %>%
group_by(XXXX, XXXX) %>%
filter(XXXX =="W" | XXXX == "B") %>%
dplyr::summarize(
mental_health_level = mean(mhneeds_lvl, na.rm=TRUE)
)
fake_data %>%
group_by(ldesc, ethnic_cd) %>%
filter(ethnic_cd =="W" | ethnic_cd == "B") %>%
dplyr::summarize(
mental_health_level = mean(mhneeds_lvl, na.rm=TRUE)
)
## # A tibble: 10 x 3
## # Groups: ldesc [5]
## ldesc ethnic_cd mental_health_level
## <fct> <fct> <dbl>
## 1 ACT OF SABOTAGE B 2.75
## 2 ACT OF SABOTAGE W 3.29
## 3 ESCAPE W/ FORCE B 2.67
## 4 ESCAPE W/ FORCE W 3.33
## 5 PROPERTY DAMAGE B 3.17
## 6 PROPERTY DAMAGE W 3.29
## 7 SEXUAL ACT B 3.11
## 8 SEXUAL ACT W 1.8
## 9 USE OF FORCE B 1.83
## 10 USE OF FORCE W 3.4
A few things to note:
The Sankey diagram is in a different order to the data in the table. Sankey automatically orders the categories to minimize the amount of overlap between rows. Where two rows in the table contain the same information, Sankey automatically combines them. The Sankey diagram automatically merges together the nodes (blocks) that have the same values.
click here for a discussion about how to interpret these diagrams.
Sankey_data <-IPV_subset %>%
select(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>%
group_by(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>%
dplyr::summarize(
count = n())
Let’s look at the resulting data
Sankey_data
## # A tibble: 291 x 5
## # Groups: AREA.NAME, Vict.Sex, Vict.Descent [148]
## AREA.NAME Vict.Sex Vict.Descent Status.Desc count
## <fct> <fct> <fct> <fct> <int>
## 1 77th Street F B Adult Arrest 10
## 2 77th Street F B Adult Other 13
## 3 77th Street F B Invest Cont 36
## 4 77th Street F H Adult Arrest 7
## 5 77th Street F H Adult Other 4
## 6 77th Street F H Invest Cont 20
## 7 77th Street F W Adult Other 2
## 8 77th Street F W Invest Cont 3
## 9 77th Street M B Adult Arrest 5
## 10 77th Street M B Adult Other 5
## # ... with 281 more rows
library(ggalluvial)
ggplot(data = Sankey_data,
aes(axis1 = AREA.NAME, axis2 = Vict.Sex, axis3 = Vict.Descent,
y = count)) +
scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = Status.Desc)) +
geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
theme_minimal() +
ggtitle("Investigation Status of IPV Incidents",
"Stratified by Victim demographics and Community") +
theme(legend.position = 'bottom')
You will notice that the diagram needs to be cleaned up a bit. First, there are too many neighborhood areas represented, also the race/ethnicity variable has too few cases for some of the categories. Let’s select six neighborhood areas and redraw the chart.
Sankey_data <-IPV_subset %>%
select(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>%
group_by(AREA.NAME, Vict.Sex, Vict.Descent, Status.Desc) %>%
filter(AREA.NAME == "77th Street" | AREA.NAME == "N Hollywood" | AREA.NAME == "Southeast"| AREA.NAME == "Southwest"| AREA.NAME == "Rampart"| AREA.NAME == "Olympic") %>%
dplyr::summarize(
count = n())
ggplot(data = Sankey_data,
aes(axis1 = AREA.NAME, axis2 = Vict.Sex, axis3 = Vict.Descent,
y = count)) +
scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = Status.Desc)) +
geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
theme_minimal() +
ggtitle("Investigation Status of IPV Incidents",
"Stratified by Victim demographics and Community") +
theme(legend.position = 'bottom')
In this exercise you will create a Sankey diagram using the crime category description instead of crime area, along with dempgraphic characteristics gender and Race/Ethnicity First, create a data set called Sankey as we did above and select the relevant variables. For this excercise only select Blacks, Hispanics and Whites. Use the dplyr commands to format the data in the appropriate way. Follow the steps below:
Sankey_data <-IPV_subset %>%
select(XXXX, Vict.Sex, Vict.Descent, Status.Desc) %>%
group_by(XXXX, Vict.Sex, Vict.Descent, Status.Desc) %>%
filter(XXXX) %>%
dplyr::summarize(
count = n())
ggplot(data = Sankey_data,
aes(axis1 = XXXX, axis2 = XXXX, axis3 = XXXX,
y = count)) +
scale_x_discrete(limits = c(XXXX), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = XXXX)) +
geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
theme_minimal() +
ggtitle("Investigation Status of IPV Incidents",
"Stratified by Victim demographics and Place of Injury") +
theme(legend.position = 'bottom')
Sankey_data <-IPV_subset %>%
dplyr::select(Premis.Desc, Vict.Sex, Vict.Descent, Status.Desc) %>%
dplyr::mutate(Premis.Desc = str_replace(Premis.Desc, ".*MULTI-UNIT.*", "Apartment")) %>%
dplyr::mutate(Premis.Desc = str_replace(Premis.Desc, ".*SINGLE.*", "Family Home")) %>%
dplyr::group_by(Premis.Desc, Vict.Sex, Vict.Descent, Status.Desc) %>%
dplyr::filter(Vict.Descent == "B" | Vict.Descent == "H" | Vict.Descent == "W") %>%
dplyr::summarize(
count = n())
ggplot(data = Sankey_data,
aes(axis1 = Premis.Desc, axis2 = Vict.Sex, axis3 = Vict.Descent,
y = count)) +
scale_x_discrete(limits = c("Area", "Sex", "Race/Ethnicity"), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = Premis.Desc)) +
geom_stratum() +
geom_text(stat = "stratum", infer.label = TRUE) +
theme_minimal() +
ggtitle("Investigation Status of IPV Incidents",
"Stratified by Victim demographics and Place of Injury") +
theme(legend.position = 'bottom')
How do we get from X to the committed offense?
Sankey_data <-fake_data %>%
select(highschool_ged, ethnic_cd, mhneeds_lvl, off_mdesc) %>%
mutate(off_mdesc = str_replace(off_mdesc, ".*ABUSE .*", "Misdemeanor")) %>%
mutate(off_mdesc = str_replace(off_mdesc, ".*HOMICIDE-.*", "Homicide")) %>%
mutate(off_mdesc = str_replace(off_mdesc, ".*FINANCIAL .*", "Financial")) %>%
mutate(off_mdesc = str_replace(off_mdesc, ".*CRIME .*", "Crime Act")) %>%
mutate(off_mdesc = str_replace(off_mdesc, ".*CRIMINAL .*", "Criminal Attempt")) %>%
group_by(highschool_ged, ethnic_cd, mhneeds_lvl, off_mdesc) %>%
filter(ethnic_cd == "B" | ethnic_cd == "H" | ethnic_cd == "W") %>%
dplyr::summarize(
count = n())
ggplot(data = Sankey_data,
aes(axis1 = highschool_ged, axis2 = ethnic_cd, axis3 = mhneeds_lvl,
y = count)) +
scale_x_discrete(limits = c("GED Status", "Race/Ethnicity", "Mental Health"), expand = c(.1, .05)) +
xlab("Demographic") +
geom_alluvium(aes(fill = off_mdesc), main = "Offense Description") +
geom_stratum() +
geom_text(stat = "stratum", infer.label = TRUE) +
theme_minimal() +
ggtitle("Investigation Status of IPV Incidents",
"Stratified by Victim demographics and Place of Injury") +
theme(legend.position = 'bottom')
Sometimes we would rather connect to the data directly from the provider’s website. These providers make their data available through the API. Some websites require you to register and get a token prior to accessing the API. This is a powerful way to connect to data, as the following examples demonstrate.
The R library that allows you to connect to, and query, a website is called RSocrata. This is like reading a csv file directly from a website. However, RSocrata allows you to query the data (i.e. the big data) before you open it! You do this by passing the query string directly into url, as follows.
Let’s open the la city crime data by directly accessing it from city of LA’s open data portal. Note, this will take a few seconds because it’s a very large file.
library(RSocrata)
base_url = "https://data.lacity.org/resource/2nrs-mtv8.json?" #this is 2020 to present
my_token <- "w0BkWUPZYzjQRwNEVX8KEijw4"
lacity_data <- read.socrata(base_url, my_token)
glimpse(lacity_data)
## Observations: 114,582
## Variables: 28
## $ dr_no <chr> "010304468", "190101086", "191501505", "1919212...
## $ date_rptd <dttm> 2020-01-08, 2020-01-02, 2020-01-01, 2020-01-01...
## $ date_occ <dttm> 2020-01-08, 2020-01-01, 2020-01-01, 2020-01-01...
## $ time_occ <chr> "2230", "0330", "1730", "0415", "0030", "1315",...
## $ area <chr> "03", "01", "15", "19", "01", "01", "01", "01",...
## $ area_name <chr> "Southwest", "Central", "N Hollywood", "Mission...
## $ rpt_dist_no <chr> "0377", "0163", "1543", "1998", "0163", "0161",...
## $ part_1_2 <chr> "2", "2", "2", "2", "1", "1", "2", "1", "1", "2...
## $ crm_cd <chr> "624", "624", "745", "740", "121", "442", "946"...
## $ crm_cd_desc <chr> "BATTERY - SIMPLE ASSAULT", "BATTERY - SIMPLE A...
## $ mocodes <chr> "0444 0913", "0416 1822 1414", "0329 1402", "03...
## $ vict_age <chr> "36", "25", "76", "31", "25", "23", "0", "23", ...
## $ vict_sex <chr> "F", "M", "F", "X", "F", "M", "X", "M", "M", "M...
## $ vict_descent <chr> "B", "H", "W", "X", "H", "H", "X", "B", "A", "O...
## $ premis_cd <chr> "501", "102", "502", "409", "735", "404", "726"...
## $ premis_desc <chr> "SINGLE FAMILY DWELLING", "SIDEWALK", "MULTI-UN...
## $ weapon_used_cd <chr> "400", "500", NA, NA, "500", NA, NA, NA, "306",...
## $ weapon_desc <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)...
## $ status <chr> "AO", "IC", "IC", "IC", "IC", "IC", "IC", "IC",...
## $ status_desc <chr> "Adult Other", "Invest Cont", "Invest Cont", "I...
## $ crm_cd_1 <chr> "624", "624", "745", "740", "121", "442", "946"...
## $ location <chr> "1100 W 39TH PL", "700...
## $ lat <chr> "34.0141", "34.0459", "34.1685", "34.2198", "34...
## $ lon <chr> "-118.2978", "-118.2545", "-118.4019", "-118.44...
## $ crm_cd_2 <chr> NA, NA, "998", NA, "998", "998", "998", "998", ...
## $ cross_street <chr> NA, NA, NA, NA, NA, NA, NA, NA, "OLIVE", NA, NA...
## $ crm_cd_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crm_cd_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Notice that all of the crimes committed were downloaded, over 95,000. This is only data from Jan 1, 2020 to the present. Notice also that the variable names are lower case, which is important because R is case sensitive.
I can query the data by creating a SQL statement and embedding it into the URL. Depending on the type of data you have, some query strings include: – is – between – is not – starts with – contains
Let’s create the string and query the data between June 1, 2020 and the present. I used a trick for the present date, which is to grab the computer’s system date (Sys.Date()) which should be today’s date.
Sys.Date() # Today's date
## [1] "2020-08-10"
(query_string = paste("'", Sys.Date(),"'", sep="")) # The query string has to be a particular format, surrounded by single quotes
## [1] "'2020-08-10'"
(base_url = paste("https://data.lacity.org/resource/2nrs-mtv8.json?$where=date_occ between '2020-06-01' and", query_string)) #this is 2020 to present
## [1] "https://data.lacity.org/resource/2nrs-mtv8.json?$where=date_occ between '2020-06-01' and '2020-08-10'"
Put the URL above into the browser to see the data that will be returned. If there is no data there is an error.
Note: The overwhelming majority of sites have this same format.
Now we can connect to the data as before.
lacity_data <- read.socrata(base_url, my_token)
glimpse(lacity_data)
## Observations: 32,626
## Variables: 28
## $ dr_no <chr> "200610840", "201213939", "200709884", "2014116...
## $ date_rptd <dttm> 2020-06-01, 2020-06-01, 2020-06-01, 2020-06-01...
## $ date_occ <dttm> 2020-06-01, 2020-06-01, 2020-06-01, 2020-06-01...
## $ time_occ <chr> "0930", "1200", "1030", "0030", "0930", "0115",...
## $ area <chr> "06", "12", "07", "14", "20", "04", "02", "21",...
## $ area_name <chr> "Hollywood", "77th Street", "Wilshire", "Pacifi...
## $ rpt_dist_no <chr> "0646", "1259", "0711", "1488", "2033", "0497",...
## $ part_1_2 <chr> "1", "1", "1", "1", "1", "1", "1", "2", "1", "2...
## $ crm_cd <chr> "210", "510", "510", "510", "236", "310", "510"...
## $ crm_cd_desc <chr> "ROBBERY", "VEHICLE - STOLEN", "VEHICLE - STOLE...
## $ mocodes <chr> "0400 0432 0344 0411 2033", NA, NA, NA, "2000 1...
## $ vict_age <chr> "35", "0", "0", "0", "27", "0", "0", "25", "0",...
## $ vict_sex <chr> "M", NA, NA, NA, "F", "X", NA, "F", NA, "M", "M...
## $ vict_descent <chr> "W", NA, NA, NA, "H", "X", NA, "W", NA, "H", "W...
## $ premis_cd <chr> "102", "104", "707", "108", "108", "405", "101"...
## $ premis_desc <chr> "SIDEWALK", "DRIVEWAY", "GARAGE/CARPORT", "PARK...
## $ weapon_used_cd <chr> "500", NA, NA, NA, "512", NA, NA, NA, NA, "400"...
## $ weapon_desc <chr> "UNKNOWN WEAPON/OTHER WEAPON", NA, NA, NA, "MAC...
## $ status <chr> "IC", "IC", "IC", "IC", "IC", "IC", "IC", "IC",...
## $ status_desc <chr> "Invest Cont", "Invest Cont", "Invest Cont", "I...
## $ crm_cd_1 <chr> "210", "510", "510", "510", "236", "310", "510"...
## $ location <chr> "SELMA", "700 E 76TH P...
## $ cross_street <chr> "VINE", NA, NA, NA, "7TH ...
## $ lat <chr> "34.0998", "33.9703", "34.0782", "33.9553", "34...
## $ lon <chr> "-118.3267", "-118.263", "-118.3732", "-118.385...
## $ crm_cd_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "99...
## $ crm_cd_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ crm_cd_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Create a query string that selects only cases where the variable vict.sex is female only.
base_url = "https://data.lacity.org/resource/2nrs-mtv8.json?$where=vict_sex = 'F'"
lacity_data_females <- read.socrata(base_url, my_token)
glimpse(lacity_data_females)
Let’s make a wordcloud of the area names in the full data set.
library(RColorBrewer)
library(wordcloud)
pal = brewer.pal(9,"Blues")
street_name <- as.data.frame(table(lacity_data$area_name))
colnames(street_name) <- c("Street_Name", "Count")
wordcloud(street_name$Street_Name, street_name$Count, min.freq = 100, max.words = 15, random.order = T, random.color = T, colors =c("green", "cornflowerblue", "darkred"), scale = c(2,.3))
library(treemap)
## Warning: package 'treemap' was built under R version 3.6.3
treemap_df <-
lacity_data %>%
dplyr::filter(str_detect(crm_cd_desc, "CHILD NEGLECT|CHILD ABUSE|INTIMATE PARTNER")) %>%
dplyr::group_by(area_name, crm_cd_desc) %>%
dplyr::summarize(n = n())
treemap(treemap_df,
index=c("area_name","crm_cd_desc"),
vSize="n",
type="index",
fontsize.labels=c(15,12),
fontcolor.labels=c("white","orange"),
fontface.labels=c(2,1),
bg.labels=c("transparent"),
align.labels=list(
c("center", "center"),
c("center", "top")
),
overlap.labels=0.2,
inflate.labels=T
)
treemap_df <-
fake_data %>%
dplyr::group_by(ethnic_cd, ldesc) %>%
dplyr::summarize(n = n())
treemap(treemap_df,
index=c("ethnic_cd","ldesc"),
vSize="n",
type="index",
fontsize.labels=c(15,12),
fontcolor.labels=c("white","orange"),
fontface.labels=c(2,1),
bg.labels=c("transparent"),
align.labels=list(
c("center", "center"),
c("center", "top")
),
overlap.labels=0.2,
inflate.labels=F
)
I want to divert for a second to show you how to get the mo codes into text and then introduce the idea of mapping.
ipv_crimes_in_la <- lacity_data %>%
mutate(vict_age = as.numeric(vict_age)) %>%
filter(str_detect(crm_cd_desc, "INTIMATE PARTNER"),
str_detect(premis_desc, "MOTORHOME|GROUP HOME|MOTEL|DWELLING|RESIDENTIAL|HOUSING")) %>%
mutate(crm_cd_desc = str_replace(crm_cd_desc, ".*INTIMATE PARTNER.*", "IPV"))%>%
select("dr_no", "crm_cd_desc", "date_occ", "time_occ", "lat", "lon", "mocodes","area_name", "vict_age", "vict_sex", "vict_descent", "premis_desc", "weapon_desc", "status_desc")
glimpse(ipv_crimes_in_la)
## Observations: 1,607
## Variables: 14
## $ dr_no <chr> "200910349", "201812154", "202110215", "201214247...
## $ crm_cd_desc <chr> "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", ...
## $ date_occ <dttm> 2020-06-01, 2020-06-04, 2020-06-05, 2020-06-06, ...
## $ time_occ <chr> "0105", "1027", "1700", "2000", "1530", "0230", "...
## $ lat <chr> "34.1576", "33.9274", "34.2071", "33.9821", "34.2...
## $ lon <chr> "-118.4074", "-118.2871", "-118.5754", "-118.3318...
## $ mocodes <chr> "2000 0416 1222", "2000 2002 0329", "2000 1241 04...
## $ area_name <chr> "Van Nuys", "Southeast", "Topanga", "77th Street"...
## $ vict_age <dbl> 29, 32, 31, 32, 47, 38, 73, 46, 36, 29, 58, 22, 3...
## $ vict_sex <chr> "M", "F", "F", "F", "F", "F", "F", "F", "M", "F",...
## $ vict_descent <chr> "H", "B", "H", "B", "B", "B", "O", "B", "H", "H",...
## $ premis_desc <chr> "MOTEL", "SINGLE FAMILY DWELLING", "MULTI-UNIT DW...
## $ weapon_desc <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",...
## $ status_desc <chr> "Invest Cont", "Invest Cont", "Adult Arrest", "In...
crimebyarea <- ipv_crimes_in_la %>%
dplyr::group_by(crm_cd_desc, area_name) %>%
dplyr::summarise(count= n())
crimebyarea
## # A tibble: 21 x 3
## # Groups: crm_cd_desc [1]
## crm_cd_desc area_name count
## <chr> <chr> <int>
## 1 IPV 77th Street 172
## 2 IPV Central 56
## 3 IPV Devonshire 46
## 4 IPV Foothill 57
## 5 IPV Harbor 108
## 6 IPV Hollenbeck 49
## 7 IPV Hollywood 54
## 8 IPV Mission 98
## 9 IPV N Hollywood 70
## 10 IPV Newton 74
## # ... with 11 more rows
ipv_crimes_in_la$mo <- ipv_crimes_in_la$mocodes
ipv_crime_mo <- separate(data = ipv_crimes_in_la, col = mocodes, into =
c("m1", "m2", "m3", "m4", "m5", "m6", "m7", "m8", "m9", "m10"),
sep = " ")
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 1550
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
#makes all "m" variables numeric at once
ipv_crime_mo[,7:16] <- sapply(ipv_crime_mo[,7:16],as.numeric)
glimpse(ipv_crime_mo)
## Observations: 1,607
## Variables: 24
## $ dr_no <chr> "200910349", "201812154", "202110215", "201214247...
## $ crm_cd_desc <chr> "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", "IPV", ...
## $ date_occ <dttm> 2020-06-01, 2020-06-04, 2020-06-05, 2020-06-06, ...
## $ time_occ <chr> "0105", "1027", "1700", "2000", "1530", "0230", "...
## $ lat <chr> "34.1576", "33.9274", "34.2071", "33.9821", "34.2...
## $ lon <chr> "-118.4074", "-118.2871", "-118.5754", "-118.3318...
## $ m1 <dbl> 2000, 2000, 2000, 2000, 2000, 421, 416, 2000, 124...
## $ m2 <dbl> 416, 2002, 1241, 1814, 1814, 913, 1202, 913, 381,...
## $ m3 <dbl> 1222, 329, 448, 913, 913, 2000, 2000, 1814, 1813,...
## $ m4 <dbl> NA, NA, 2002, 432, 416, 1814, 1241, 448, 2000, 44...
## $ m5 <dbl> NA, NA, 444, 444, NA, 416, 913, 444, 913, 417, 41...
## $ m6 <dbl> NA, NA, NA, 429, NA, NA, NA, 1813, 416, NA, 2002,...
## $ m7 <dbl> NA, NA, NA, 448, NA, NA, NA, 1309, 444, NA, NA, N...
## $ m8 <dbl> NA, NA, NA, 216, NA, NA, NA, NA, 446, NA, NA, NA,...
## $ m9 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1821, NA, NA, NA,...
## $ m10 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ area_name <chr> "Van Nuys", "Southeast", "Topanga", "77th Street"...
## $ vict_age <dbl> 29, 32, 31, 32, 47, 38, 73, 46, 36, 29, 58, 22, 3...
## $ vict_sex <chr> "M", "F", "F", "F", "F", "F", "F", "F", "M", "F",...
## $ vict_descent <chr> "H", "B", "H", "B", "B", "B", "O", "B", "H", "H",...
## $ premis_desc <chr> "MOTEL", "SINGLE FAMILY DWELLING", "MULTI-UNIT DW...
## $ weapon_desc <chr> "STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",...
## $ status_desc <chr> "Invest Cont", "Invest Cont", "Adult Arrest", "In...
## $ mo <chr> "2000 0416 1222", "2000 2002 0329", "2000 1241 04...
tbl_lookup<-read.csv("D:/Barboza Law/COC presentation/Day 2/MO_CODES_Numerical_20180627.csv")
names(tbl_lookup)[1] <- "id"
for (i in 1:10){
ipv_crime_mo[,(length(ipv_crime_mo)+1)] = tbl_lookup[match(ipv_crime_mo[,(i+6)], tbl_lookup$id), "descript"]
}
library(writexl)
## Warning: package 'writexl' was built under R version 3.6.3
write_xlsx(ipv_crime_mo, "ipv_crime_mo.xlsx")
We know have all IPV incidents between June 1, 2020 to the present (whatever that date is, today happens to be July 8, but tomorrow that will be a different date). Let’s illustrate the power of data mining by returning all of the modus operandi codes where the partners were “gay.”
homosexual_data <- ipv_crime_mo %>%
filter_at(.vars = vars(V25, V26, V27, V28, V29, V30, V31, V32, V33, V34),
.vars_predicate = any_vars(str_detect(. , "Homo")))
Next, let’s start to explore how to cluster crime across space (and time). You may have noticed that this dataset contains the lattitude (Y) and longitude (X) associated with each crime incident. Let’s map the crimes we downloaded and see if we can see any unique relationships.
R has a number of powerful GIS-related functions. We will use a number of these functions to map our data. These appear below.
library(rgdal)
## Loading required package: sp
##
## Attaching package: 'sp'
## The following object is masked from 'package:ggraph':
##
## geometry
## rgdal: version: 1.4-3, (SVN revision 828)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/g.barboza/Documents/R/win-library/3.6/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/g.barboza/Documents/R/win-library/3.6/rgdal/proj
## Linking to sp version: 1.3-1
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(sp)
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(tmap)
## Warning: package 'tmap' was built under R version 3.6.1
library(tigris)
## Warning: package 'tigris' was built under R version 3.6.1
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:igraph':
##
## blocks
## The following object is masked from 'package:graphics':
##
## plot
In order to create a boundary of our region, we need to have a shapefile. I have already downloaded a shapefile of the city of LA, and it also contains data on COVID-19 cases across the city.
Open shapefile
lacity_bdry <- st_read("D:/Barboza Law/COC presentation/Day 2/shapefiles/covid-19 by neighborhood/nghbrhd_data.shp")%>%
st_transform(4269)
## Reading layer `nghbrhd_data' from data source `D:\Barboza Law\COC presentation\Day 2\shapefiles\covid-19 by neighborhood\nghbrhd_data.shp' using driver `ESRI Shapefile'
## Simple feature collection with 127 features and 8 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 6359577 ymin: 1714640 xmax: 6514629 ymax: 1945542
## epsg (SRID): NA
## proj4string: +proj=lcc +lat_1=34.03333333333333 +lat_2=35.46666666666667 +lat_0=33.5 +lon_0=-118 +x_0=2000000 +y_0=500000.0000000002 +datum=NAD83 +units=us-ft +no_defs
plot(lacity_bdry)
In order to map the crime data onto the city boundary, we have to first transform it into a map file, which is really easy using the sf library in R. Here are the simple steps:
points_sf <- st_as_sf(homosexual_data, coords = c("lon", "lat"))
st_crs(points_sf) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(points_sf$geometry, col = "dark red", lwd = 4, cex = .4, add = TRUE)
Cool. There is clearly a cluster near Rampart.
In this exercise, you will create a spatial point file of all incidents and overlay it onto the shapefile of la city. Since there is more data there is inevitably some data cleaning to do. First, we want to select only valid lat and long values. Look at the data and you will see that there are a number of lat and long with values of 0. Let’s remove them.
#ipv_crime_mo <- ipv_crime_mo[complete.cases(ipv_crime_mo), ]
ipv_crime_mo$lon <- as.numeric(as.character(ipv_crime_mo$lon))
ipv_crime_mo$lat <- as.numeric(as.character(ipv_crime_mo$lat))
ipv_crime_mo <- ipv_crime_mo %>% filter(lon<0)
points_sf <- st_as_sf(ipv_crime_mo, coords = c(XXXX))
st_crs(XXXX) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(XXXX, col = "dark red", lwd = 4, cex = .4, add = TRUE)
points_sf <- st_as_sf(ipv_crime_mo, coords = c("lon", "lat"))
st_crs(points_sf) <- st_crs(lacity_bdry)
plot(lacity_bdry$geometry)
plot(points_sf$geometry, col = "dark red", lwd = 2, cex = .2, add = TRUE)
merged_crimes <- st_join(lacity_bdry,points_sf) %>%
dplyr::group_by(COMTY_NAME) %>% # group by unique tract ID
dplyr::mutate(ipv_crimes = n()) %>%
dplyr::ungroup() %>%
dplyr::select(COMTY_NAME, ipv_crimes, geometry, count) %>%
dplyr::distinct(COMTY_NAME, ipv_crimes, geometry, count)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
tmap is similar to ggmap in that each input dataset can be âmappedâ in a range of different ways including location on the map (defined by dataâs geometry), color, and other visual variables. The basic building block is tm_shape() (which defines input data, raster and vector objects), followed by one or more layer elements such as tm_fill() and tm_dots(). This layering is demonstrated in the chunk below.
The object passed to tm_shape() in this case is merged_crimes, an sf object representing the the IPV incident counts in census tracts from the city of LA boundary file. Layers are added to represent merged_crimes visually, with tm_fill() and tm_borders() creating shaded areas (left panel) and border outlines
# Add fill layer to nz shape
tm_shape(merged_crimes) +
tm_fill()
# Add border layer to nz shape
tm_shape(merged_crimes) +
tm_borders()
# Add fill and border layers to nz shape
tm_shape(merged_crimes) +
tm_fill() +
tm_borders()
tm_shape(merged_crimes) +
tm_polygons()
tm_shape(merged_crimes) +
tm_fill(col = "ipv_crimes")
tm_shape(merged_crimes) +
tm_fill('ipv_crimes') +
tm_borders('gray72') +
tm_layout(legend.position = c('LEFT', 'bottom'),
panel.label.size = 1,
panel.label.color = 'black',
panel.label.bg.color = 'gainsboro',
panel.label.height = 1.25,
main.title = 'IPV in Los Angeles Cases',
main.title.size = 1.2)
tm_shape(merged_crimes) +
tm_polygons("count", title = "COVID-19 count", palette = "GnBu",
style = "kmeans",
legend.hist = T) +
tm_shape(merged_crimes) +
tm_bubbles("ipv_crimes", title.size = "Intimate Partner Violence", col = "gold")+
tm_scale_bar(width = 0.22, position = c("left", "bottom")) +
tm_compass(position = c("right", "bottom"))+
tm_layout(frame = F,
title = "City of Los Angeles",
title.size = 2,
legend.hist.size = 0.5,
legend.outside = T)
my_interactive_covid_dv_map<-tm_shape(merged_crimes) +
tm_polygons("count", title = "COVID-19 count", palette = "GnBu",
style = "kmeans",
legend.hist = T) +
tm_shape(merged_crimes) +
tm_bubbles("ipv_crimes", title.size = "Intimate Partner Violence", col = "gold")+
tm_scale_bar(width = 0.22, position = c("left", "bottom")) +
tm_compass(position = c("right", "bottom"))+
tm_layout(frame = F,
title = "City of Los Angeles",
title.size = 2,
legend.hist.size = 0.5,
legend.outside = T)
tmap_mode("view")
## tmap mode set to interactive viewing
tmap_leaflet(my_interactive_covid_dv_map)
## Compass not supported in view mode.
## Scale bar width set to 100 pixels
## Legend for symbol sizes not available in view mode.
tmap_mode(mode="plot")
## tmap mode set to plotting
tm_shape(merged_crimes) +
tm_polygons('#f0f0f0f0', border.alpha = 0.2) +
tm_shape(points_sf) +
tm_dots(size=.3, col="black")
One of the most powerful open data sources is census data. There are a few functions in R that allow you to automatically connect to, download and merge census data into your analyses. Let’s try that with the next few commands. We have data from Los Angeles, so we will download data from LA county at the tract level and then intersect the shapefiles to return only data from LA city!
In order to use these functions, you need to get a census API key (as above). This is easy to get. Once you do, you can paste your key below and have your own.
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 3.6.3
library(tidyverse)
options(tigris_use_cache = TRUE)
census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
readRenviron("~/.Renviron")
la_county <- get_acs(state = "CA", county = "Los Angeles", geography = "tract",
variables = "B10052_002", geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
head(la_county)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -118.3224 ymin: 34.23124 xmax: -118.2652 ymax: 34.27895
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## GEOID NAME
## 1 06037101110 Census Tract 1011.10, Los Angeles County, California
## 2 06037101122 Census Tract 1011.22, Los Angeles County, California
## 3 06037101210 Census Tract 1012.10, Los Angeles County, California
## 4 06037101220 Census Tract 1012.20, Los Angeles County, California
## 5 06037101300 Census Tract 1013, Los Angeles County, California
## 6 06037101400 Census Tract 1014, Los Angeles County, California
## variable estimate moe geometry
## 1 B10052_002 7 13 MULTIPOLYGON (((-118.3008 3...
## 2 B10052_002 28 25 MULTIPOLYGON (((-118.3032 3...
## 3 B10052_002 0 17 MULTIPOLYGON (((-118.2995 3...
## 4 B10052_002 38 39 MULTIPOLYGON (((-118.2859 3...
## 5 B10052_002 0 12 MULTIPOLYGON (((-118.2782 3...
## 6 B10052_002 10 15 MULTIPOLYGON (((-118.3224 3...
#v17 <- load_variables(2017, "acs5", cache = TRUE)
#View(v17)
la_county %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
coord_sf(crs = 26911) +
scale_fill_viridis_c(option = "magma")
racevars <- c(White = "P005003",
Black = "P005004",
Asian = "P005006",
Hispanic = "P004003")
la_decennial <- get_decennial(geography = "tract", variables = racevars,
state = "CA", county = "Los Angeles County", geometry = TRUE,
summary_var = "P001001")
## Getting data from the 2010 decennial Census
head(la_decennial)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -118.3224 ymin: 34.23124 xmax: -118.2653 ymax: 34.27895
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 6
## GEOID NAME variable value summary_value geometry
## <chr> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 06037~ Census ~ White 2656 4731 (((-118.2886 34.25591, -118~
## 2 06037~ Census ~ White 2437 3664 (((-118.2773 34.26196, -118~
## 3 06037~ Census ~ White 2890 5990 (((-118.2886 34.24861, -118~
## 4 06037~ Census ~ White 1662 3363 (((-118.278 34.24961, -118.~
## 5 06037~ Census ~ White 3190 4199 (((-118.2773 34.25991, -118~
## 6 06037~ Census ~ White 2649 3903 (((-118.3026 34.25239, -118~
la_decennial %>%
mutate(pct = 100 * (value / summary_value)) %>%
ggplot(aes(fill = pct)) +
facet_wrap(~variable) +
geom_sf(color = NA) +
coord_sf(crs = 26911) +
scale_fill_viridis_c()
st_crs(la_decennial) <- st_crs(lacity_bdry)
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
plot(x_and_y <- st_intersection(la_decennial, lacity_bdry), col = "lightgrey", main="st_intersection")
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
## Warning: plotting the first 10 out of 13 attributes; use max.plot = 13 to
## plot all
x_and_y %>%
mutate(pct = 100 * (value / summary_value)) %>%
ggplot(aes(fill = pct)) +
facet_wrap(~variable) +
geom_sf(color = NA) +
coord_sf(crs = 26911) +
scale_fill_viridis_c()
tm_shape(x_and_y) +
tm_polygons("value", title = "race", palette = "GnBu",
style = "kmeans") +
tm_facets(by = "variable", drop.NA.facets = TRUE) +
tm_shape(points_sf) + tm_dots(size=.1, title.size = "Intimate Partner Violence", col = "gold")
In this final exercise you will find census data for the state of Colorado, county of Denver, and map it.
options(tigris_use_cache = TRUE)
census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
readRenviron("~/.Renviron")
denver_county <- get_acs(state = "XXXX", county = "XXXX", geography = "XXXX",
variables = "XXXX", geometry = TRUE)
options(tigris_use_cache = TRUE)
census_api_key("0f318090e26bc2bcdb1bda09a7ff3be3f421bf84", overwrite=TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
readRenviron("~/.Renviron")
denver_county <- get_acs(state = "CO", county = "Denver County", geography = "tract",
variables = "B10052_002", geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
tm_shape(denver_county) +
tm_polygons("estimate", title = "Number of Disabled in Denver Co", palette = "GnBu",
style = "kmeans",
legend.hist = T) +
tm_shape(denver_county) +
tm_bubbles("moe", title.size = "Margin of Error", col = "gold")+
tm_scale_bar(width = 0.22, position = c("left", "top")) +
tm_compass(position = c("right", "bottom"))+
tm_layout(frame = F,
title = "Denver County",
title.size = 2,
legend.hist.size = 0.5,
legend.outside = T)